Learning to Mix n-Step Returns: Generalizing lambda-Returns for Deep Reinforcement Learning

نویسندگان

  • Sahil Sharma
  • Girish Raguvir J
  • Srivatsan Ramesh
  • Balaraman Ravindran
چکیده

Reinforcement Learning (RL) can model complex behavior policies for goaldirected sequential decision making tasks. A hallmark of RL algorithms is Temporal Difference (TD) learning: value function for the current state is moved towards a bootstrapped target that is estimated using next state’s value function. λ-returns generalize beyond 1-step returns and strike a balance between Monte Carlo and TD learning methods. While λ-returns have been extensively studied in RL, they haven’t been explored a lot in Deep RL. This paper’s first contribution is an exhaustive benchmarking of λ-returns. Although mathematically tractable, the use of exponentially decaying weighting of n-step returns based targets in λ-returns is a rather ad-hoc design choice. Our second major contribution is that we propose a generalization of λ-returns called Confidence-based Autodidactic Returns (CAR), wherein the RL agent learns the weighting of the n-step returns in an end-to-end manner. This allows the agent to learn to decide how much it wants to weigh the n-step returns based targets. In contrast, λ-returns restrict RL agents to use an exponentially decaying weighting scheme. Autodidactic returns can be used for improving any RL algorithm which uses TD learning. We empirically demonstrate that using sophisticated weighted mixtures of multi-step returns (like CAR and λ-returns) considerably outperforms the use of n-step returns. We perform our experiments on the Asynchronous Advantage Actor Critic (A3C) algorithm in the Atari 2600 domain.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Distributed Distributional Deterministic Policy Gradients

This work adopts the very successful distributional perspective on reinforcement learning and adapts it to the continuous control setting. We combine this within a distributed framework for off-policy learning in order to develop what we call the Distributed Distributional Deep Deterministic Policy Gradient algorithm, D4PG. We also combine this technique with a number of additional, simple impr...

متن کامل

CFQI: Fitted Q-Iteration with Complex Returns

Fitted Q-Iteration (FQI) is a popular approximate value iteration (AVI) approach that makes effective use of off-policy data. FQI uses a 1-step return value update which does not exploit the sequential nature of trajectory data. Complex returns (weighted averages of the n-step returns) use trajectory data more effectively, but have not been used in an AVI context because of off-policy bias. In ...

متن کامل

Improving Approximate Value Iteration with Complex Returns by Bounding

Approximate value iteration (AVI) is a widely used technique in reinforcement learning. Most AVI methods do not take full advantage of the sequential relationship between samples within a trajectory in deriving value estimates, due to the challenges in dealing with the inherent bias and variance in the n-step returns. We propose a bounding method which uses a negatively biased but relatively lo...

متن کامل

Beyond the One Step Greedy Approach in Reinforcement Learning

The famous Policy Iteration algorithm alternates between policy improvement and policy evaluation. Implementations of this algorithm with several variants of the latter evaluation stage, e.g, n-step and trace-based returns, have been analyzed in previous works. However, the case of multiple-step lookahead policy improvement, despite the recent increase in empirical evidence of its strength, has...

متن کامل

Deep Learning for Forecasting Stock Returns in the Cross-Section

Many studies have been undertaken by using machine learning techniques, including neural networks, to predict stock returns. Recently, a method known as deep learning, which achieves high performance mainly in image recognition and speech recognition, has attracted attention in the machine learning field. This paper implements deep learning to predict one-month-ahead stock returns in the cross-...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1705.07445  شماره 

صفحات  -

تاریخ انتشار 2017